Add optional DuckDB connectors prototype (waterdata + wqp) by thodson-usgs · Pull Request #241 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-04-25T22:59:25Z

Summary

Adds dataretrieval.duckdb_connectors, an optional extension that wraps DuckDB connections with helper methods exposing the waterdata (OGC) and wqp endpoints as registerable SQL views. Each helper returns a duckdb.DuckDBPyRelation; pass register_as=<name> to also publish the result as a named view that subsequent SQL can reference.

This is a prototype / draft — opening for early feedback on shape and coverage before any release. Status: 15 tests passing, ruff check + ruff format --check clean, demo notebook executes end-to-end against the live API.

Why

Several use cases get noticeably more ergonomic in SQL than in pandas:

Joining heterogeneous endpoints — site metadata, daily values, time-series metadata, water-quality results all live behind separate getters; once registered as views they JOIN cleanly.
Window functions / time aggregations — top-N flow days per gauge, monthly means, rolling windows.
Compose with files on disk — DuckDB reads Parquet/CSV/S3 natively, so cached water data can be joined with external datasets without leaving SQL.

What's in the box

from dataretrieval.duckdb_connectors import waterdata, wqp

with waterdata.connect() as wd:
    wd.monitoring_locations(state_name="Illinois", register_as="sites")
    wd.daily(monitoring_location_id=["USGS-05586100"], parameter_code="00060",
             time="2023-01-01/2023-12-31", register_as="daily")
    wd.sql("""
        SELECT s.monitoring_location_name, avg(d.value) AS mean_cfs
        FROM sites s JOIN daily d USING (monitoring_location_id)
        GROUP BY 1 ORDER BY 2 DESC
    """)

Layout

dataretrieval/duckdb_connectors/
├── _base.py         # _require_duckdb, _flatten_geometry, _BaseConnection
├── waterdata.py     # WaterdataConnection + connect()
└── wqp.py           # WQPConnection + connect(legacy=, ssl_check=)

dataretrieval/duckdb_connector.py (singular) is preserved as a backward-compat alias re-exporting the waterdata connector.

Endpoints exposed

waterdata: monitoring_locations, daily, continuous, time_series_metadata, latest_continuous, latest_daily, field_measurements, samples
wqp: get_results, what_sites, what_organizations, what_projects, what_activities, what_detection_limits, what_habitat_metrics, what_activity_metrics (connection holds a legacy / ssl_check default; per-call overrides supported)
Generic: register_table(name, fn, **kwargs) for any (DataFrame, metadata)-returning getter not yet wrapped.

Optional dependencies

Added to pyproject.toml:

duckdb  = ["duckdb>=1.0.0"]
spatial = ["dataretrieval[nldi]", "dataretrieval[duckdb]"]   # compound extra

The DuckDB spatial extension is a runtime C++ binary that DuckDB downloads on first INSTALL spatial — not a pip package. Pass spatial=True to connect() to install + load it; registered geometry columns (stored as WKT) can then be parsed with ST_GeomFromText.

pyproject.toml tool.ruff.target-version was bumped to py310 (waterdata tests already gate at 3.10).

Geometry handling

Default: GeoDataFrame geometry is flattened to WKT plus longitude/latitude columns so the connectors work without the spatial extension. With spatial=True, native ST_* functions are available against the same WKT column.

Demo

demos/duckdb_waterdata_demo.ipynb walks through site discovery, daily values, monthly aggregation, top-N window functions, sites×daily join, latest readings, a cross-source waterdata × WQP join, and the spatial-extension affordance.

Test plan

pytest tests/duckdb_connectors_waterdata_test.py tests/duckdb_connectors_wqp_test.py — 15 passing (mocked at the getter boundary; spatial test skips if extension can't be fetched)
ruff check + ruff format --check clean
Live-tested every notebook section against api.waterdata.usgs.gov + waterqualitydata.us with an API token
Backward-compat: from dataretrieval import duckdb_connector; duckdb_connector.connect() still works
Decide whether NOTES.md (design notes captured during prototyping) belongs in the repo or should be dropped before merge
Decide whether to widen coverage to nldi / ngwmn

Known follow-ups

Hot-path import geopandas cleanup — already applied in this PR; will look for analogous spots elsewhere in the codebase in a follow-up branch.
DuckDB ST_Distance_Sphere axis convention surprised me during the demo; switched the demo to ST_Within with a polygon, which is unambiguous. Worth filing upstream once confirmed against duckdb-spatial docs.

🤖 Generated with Claude Code

Adds dataretrieval.duckdb_connectors, an optional extension that wraps DuckDB connections with helper methods exposing the dataretrieval waterdata (OGC) and wqp endpoints as registerable SQL views. Each helper returns a duckdb.DuckDBPyRelation; pass register_as=<name> to also publish the result as a named view that subsequent SQL can reference. Highlights: * Per-source layout (duckdb_connectors/{waterdata,wqp}.py) sharing a thin _BaseConnection in _base.py * Optional dependency: pip install dataretrieval[duckdb] * Compound spatial extra (pip install dataretrieval[spatial]) bundles geopandas + duckdb; spatial=True flag on connect() runs INSTALL spatial; LOAD spatial on the underlying connection so ST_GeomFromText etc. become available against registered views * Geometry handled by converting GeoDataFrame geometry to WKT plus longitude/latitude columns so the prototype works without the spatial extension by default * WQP connector threads connection-level legacy / ssl_check defaults through to every helper; per-call overrides supported * dataretrieval.duckdb_connector preserved as a backward-compat alias for the waterdata connector * Demo notebook covering site discovery, daily values, monthly aggregation, top-N window functions, sites x daily joins, latest readings, cross-source waterdata x WQP joins, and the spatial flag 15 tests pass; ruff check + format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Narrow `_flatten_geometry` exception from `Exception` to `ValueError` (the actual error geopandas raises on `.x`/`.y` for non-Point geometries) so genuine bugs aren't swallowed. * Drop the `if TYPE_CHECKING: import duckdb as _duckdb` block in `_base.py`. The runtime `try: import duckdb` is enough for type checkers; the alias was dead code. * Parametrize `test_what_endpoint_invokes_correct_underlying` so each WQP helper gets its own test case (matches the parametrize pattern already used in `tests/waterdata_utils_test.py`). 20 tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

elbeejay · 2026-04-26T13:09:44Z

I get the overall idea here, but I think there's some value in revisiting the overall purpose and scope of the package. My understanding was that dataretrieval is meant to be an atomic Python API wrapper for USGS water data. Every addition of a feature beyond the that base functionality increases the maintenance burden. I'd wonder if something like DuckDB support could be a separate library that wraps the core dataretrieval.

thodson-usgs · 2026-04-26T13:20:38Z

nice to hear from you @elbeejay. I agree with you, and I'm leaning against merging this PR. I was curious whether duckdb might be useful, but quickly realized that the query language isn't all that important when a bot is writing your queries.

thodson-usgs and others added 2 commits April 25, 2026 17:58

thodson-usgs closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional DuckDB connectors prototype (waterdata + wqp)#241

Add optional DuckDB connectors prototype (waterdata + wqp)#241
thodson-usgs wants to merge 2 commits intoDOI-USGS:mainfrom
thodson-usgs:worktree-duckdb-connector

thodson-usgs commented Apr 25, 2026

Uh oh!

elbeejay commented Apr 26, 2026

Uh oh!

thodson-usgs commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thodson-usgs commented Apr 25, 2026

Summary

Why

What's in the box

Layout

Endpoints exposed

Optional dependencies

Geometry handling

Demo

Test plan

Known follow-ups

Uh oh!

elbeejay commented Apr 26, 2026

Uh oh!

thodson-usgs commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants